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Online Optimization : Competing with 
Dynamic Comparators 

Ali Jadbabaie^, Alexander Rakhlin^, Shahin Shahrampour^ and Karthik Sridharan^ 


Abstract 

Recent literature on online learning has focused on developing adaptive algorithms that take advantage of a 
regularity of the sequence of observations, yet retain worst-case performance guarantees. A complementary direction 
is to develop prediction methods that perform well against complex benchmarks. In this paper, we address these two 
directions together. We present a fully adaptive method that competes with dynamic benchmarks in which regret 
guarantee scales with regularity of the sequence of cost functions and comparators. Notably, the regret bound adapts 
to the smaller complexity measure in the problem environment. Finally, we apply our results to drifting zero-sum, 
two-player games where both players achieve no regret guarantees against best sequences of actions in hindsight. 

I. Introduction 

The focus of this paper is an online optimization problem in which a learner plays against an adversary or 
nature. At each round t G {1,... ,T}, the learner chooses an action xt from some convex feasible set X C 
Then, nature reveals a convex function ftGXto the learner. As a result, the learner incurs the corresponding loss 
A learner aims to minimize his regret, a comparison to a single best action in hindsight; 

T T 

R-egr = X “ min^ /t(x). (1) 

i=l ^ t=l 

Let us refer to this as static regret in the sense that the comparator is time-invariant. In the literature, there are 
numerous algorithms that guarantee a static regret rate of 0{s/T) (see e.g. Ql-El). Moreover, when the loss 
functions are strongly convex, a rate of 0{logT) could be achieved d?). Furthermore, minimax optimality of 
algorithms with respect to the worst-case adversary has been established (see e.g. S). 

There are two major directions in which the above-mentioned results can be strengthened: (1) by exhibiting 
algorithms that compete with non-static comparator sequences (that is, making the benchmark harder), and (2) 
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by proving regret guarantees that take advantage of niceness of nature’s sequence (that is, exploiting some non- 
adversarial quality of nature’s moves). Both of these distinct directions are important avenues of investigation. In 
the present paper, we attempt to address these two aspects by developing a single, adaptive algorithm with a regret 
bound that shows the interplay between the difficulty of the comparison sequence and niceness of the sequence of 
nature’s moves. 

With respect to the first aspect, a more stringent benchmark is a time-varying comparator, a notion that can be 
termed dynamic regret a, 0-0; 

T T 

Reg^ = ^/t(xt)-^/t(x:), (2) 

where = argmin^^^/t(x). More generally, dynamic regret against a comparator sequence is 

T T 

Regy(ui, ...,ut)=Y^ Mxt) - Mut). 

It is well-known that in the worst case, obtaining a bound on dynamic regret is not possible. However, it is possible 
to achieve worst-case bounds in terms of 

T 

Ct{ui, . .. ,'ut) = ^ \\ut - Ut-i\\, (3) 

i.e., the regularity of the comparator sequence, interpolating between the static and dynamic regret notions. Fur¬ 
thermore, the authors in 0 introduce an algorithm which proposes a variant of Ct involving a dynamical model. 

In terms of the second direction, there are several ways of incorporating potential regularity of nature’s sequence. 
The authors in noi, mil bring forward the idea of predictable sequences - a generic way to incorporate some 
external knowledge about the gradients of the loss functions. Let be a predictable sequence computable 

by the learner at the beginning of round t. This sequence can then be used by an algorithm in order to achieve 
regret in terms of 

T 

= (4) 

The framework of predictable sequences captures variation and path-length type regret bounds (see e.g. m, HI). 
Yet another way in which niceness of the adversarial sequence can be captured is through a notion of temporal 
variability studied in iflTll : 

T 

= V sup |/t(a;) -/t_i(a;)|. (5) 

What is interesting—and intuitive—dynamic regret against the optimal sequence becomes a feasible 

objective when Vr is small. When only noisy versions of gradients are revealed to the algorithm, Besbes et al. 
in HI show that using a restarted Online Gradient Descent (OGD) 0 algorithm, one can get a bound of form 

T2/3(y^_^ 1)1/3 

on the expected regret. However, the regret bounds attained in lfT4l are only valid when an upper 
bound on Vr is known to the learner before the game begins. For the full information online convex optimization 


January 27, 2015 


DRAFT 


3 


setting, when one receives exact gradients instead of noisy gradients, a bound of order Vt is trivially obtained by 
simply playing (at each round) the minimum of the previous round. 

The three quantities we just introduced — Ct,Dt, Vt — measure distinct aspects of the online optimization 
problem, and their interplay is an interesting object of study. Our first contribution is to develop a fully adaptive 
method (without prior knowledge of these quantities) whose dynamic regret is given in terms of these three 
complexity measures. This is done for the full information online convex optimization setting, and augments the 
existing regret bounds in the literature which focus on only one of the three notions — Ct, Dt,Vt — (and not 
all the three together). To establish a sub-linear bound on the dynamic regret, we utilize a variant of the Optimistic 
Mirror Descent (OMD) algorithm ifTOll . 

When noiseless gradients are available and we can calculate variations at each round, we not only establish a 
regret bound in terms of Vt and T (without a priori knowledge of a bound on Vt ), but also show how the bound 
can in fact be improved when deviation Dt is o(T). We further also show how the bound can automatically adapt 
to Ct the length of sequence of comparators. Importantly, this avoids suboptimal bounds derived only in terms of 
one of the quantities — Ct, Vt — in an environment where the other one is small. 

The second contribution of this paper is the technical analysis of the algorithm. The bound on the dynamic regret 
is derived by applying the doubling trick to a non-monotone quantity which results in a non-monotone step size 
sequence (which has not been investigated to the best of authors’ knowledge). 

We provide uncoupled strategies for two players playing a sequence of drifting zero sum games. We show how 
when the two players play the provided strategies, their pay offs converge to the average minimax value of the 
sequence of games (provided the games drift slowly). In this case, both players simultaneously enjoy no regret 
guarantees against best sequences of actions in hindsight that vary slowly. This is a generalization of the results by 
Daskalakis et al. ca, and Rakhlin et al. O, both of which are for fixed games played repeatedly. 

II. Preliminaries and Problem Formulation 

A. Notation 

Throughout the paper, we assume that for any action a; G rT C at any time t, it holds that 

\Mx)\<G. (6) 

We denote by || • ||* the dual norm of || • ||, by [T] the set of natural numbers {1,..., T}, and by fi,t the shorthand 
of fi,ft, respectively. Whenever Ct is written without arguments, it will refer to regularity Ct(xI, ... ,x^) of 
the sequence of minimizers of the loss functions. We point out that our initial statements hold for the regularity 
of any sequence of comparators. However, for upper bounds involving s/Cr, one needs to choose a computable 
quantity to tune the step size, and hence our main results are stated for Ct{xI, ..., x^). 

The quantity Dt is defined with respect to an arbitrary predictable sequence {Mt}f^i, but this dependence is 
omitted for brevity. 
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B. Comparing with existing regret bounds in the dynamic setting 

We state and discuss relevant results from the literature on online learning in dynamic environments. For any 
comparator sequence {ut}f^i and the specific minima sequence the following results are established in 

the literature: 


Reference 

Regret Notion 

Regret Rate 

0, El 

TJ =1 ftixt) - Mut) 

o(VT{1 + Ct{ui,...,ut))) 

Ql 

Tj=i^[ftixt)] - ft{x*t) 

O (t2/3(1 -p VtY/^) 

im 

Mxt) - ftiu) 

0{s/D^ 

Our work 

TJ =1 ftixt) - ft{x;) 

d + 1 + min I ^{Dt + 1)Ct, {Dt + 


where 0{-) hides the log T factor. Lemma [T] below also yields a rate of O [s/Dt + 1(1 + Ct{ui, ..., ut))) for any 
comparator sequence A detailed explanation of the bounds will be done after Theorem [3] 

We remark that the authors in Dl consider a setting in which a variation budget (an upper bound on Vr) 
is known to the learner, but he/she only has noisy gradients available. Then, the restarted OGD guarantees the 
mentioned rate for convex functions; the rate is modified to a/ {Vt + 1)T for strongly convex functions. 

For the case of noiseless gradients, we first aim to show that our algorithm is adaptive in the sense that the 
learner needs not know an upper bound on Vt in advance when he/she can calculate variations observed so far. 
Furthermore, we shall establish that our method recovers the known bounds for stationary settings (as well as cases 
where Vr does not change gradually along the time horizon) 


C. Comparison of Regularity and Variability 


We now show that Vt and Ct are not comparable in general. To this end, we consider the classical problem 
of prediction with expert advice. In this setting, the learner deals with the linear loss ft{x) = {ft,x) on the 
d-dimensional probability simplex. Assume that for any f > 1, we have the vector sequence 


ft = 


(-^,0,0,...,0) , if t 


(0, —0,..., 0) , if t odd 

Setting Ut, the comparator of round t, to be the minimizer of ft, i.e. Ut = x^, we have 

T T 

Ct = Y. ||x: - xUh = 0(r) Vt = Y^ Wft - /t-illoo = 0{1), 


t=l t=l 

according to Q and (|5]l, respectively. We see that Vr is considerably smaller than Ct in this scenario. On the 
other hand, consider prediction with expert advice with two experts. Let ft = (—1/2,0) on even rounds and 
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ft = (0,1/2) on odd rounds. Expert 1 remains to be the best throughout the game, and thus Ct = 0{1), 
while variation Vr = Q{T). Therefore, one can see that taking into account only one measure might lead us to 
suboptimal regret bounds. We show that both measures play a key role in our regret bound. Finally, we note that if 
Mt = V/t_i(xt_i), the notion of Dt can be related to Vr in certain cases, yet we keep the predictable sequence 
arbitrary and thus as playing a role separate from Vr and Cr- 


III. Main Results 

A. Optimistic Mirror Descent and Relation to Regularity 

We now outline the OMD algorithm previously proposed in Go). Let 7 ^ be a 1 -strongly convex function with 
respect to a norm || • ||, and Dni', •) represent the Bregman divergence with respect to TZ. Also, let TLt be the set 
containing all available information to the learner at the beginning of time t. Then, the learner can compute the 
vector Mt : TLt which we call the predictable process. Supposing that the learner has access to the side 

information Mt S from the outset of round t, the OMD algorithm is characterized via the following interleaved 
sequence, 

a) 

(8) 


xt = SLTgmm^^;^<^ r]t{x,Mt) +'DTi{x,Xt-i) 

r]t{x,\7t) +DTz{x,Xt-i)^, 


Xt = argmin^^g;,. 

where V* = S/ft{xt), and rjt is the step size that can be chosen adaptively to attain low regret. One could observe 
that for Mt = 0, the OMD algorithm amounts to the well-known Mirror Descent algorithm m, ini. On the other 
hand, the special case of Mt = Vt_i recovers the scheme proposed in ifTSl . It is shown in ifTOl that the static regret 
satisfies 


using the step size 


R-G§T — 4i?max ^\/ Dt + I^ , 
m = i?max min I , l| , 


where ii^^x — sup^, y^p^V-nix^y). The following lemma extends the result to arbitrary sequence of comparators 
{ut}t=i- Throughout, we assume that ||Vo — Mo\\1 = I by convention. 

Lemma 1. Let X be a convex set in a Banach space B. Let TZ : B R be a 1-strongly convex function on X with 
respect to a norm || • ||, and let |1 • ||» denote the dual norm. For any L > 0, employing the time-varying step size 

L 


Vt = 


El=o IIVa - MsWl + \/e:=o IIv. - 


and running the Optimistic Mirror Descent algorithm for any comparator sequence {ut}JLi, yields 


Reg?.(„.,... ,„r) < 2 v/TTd;l + 

Lj 


2 

max 
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so long as Vt^{x,z) - V-niu^z) < 7 ||x - y\\,'ix,y,z e X. 

Lemma [1] underscores the fact that one can get a tighter bound for regret once the learner advances a sequence 
of conjectures well-aligned with the gradients. Moreover, if the learner has prior knowledge of Ct (or 

an upper bound on it), then the regret bound would be O ^A/(i5T”+T)CV^ by tuning L. 

Note that when the function TZ is Lipschitz on X, the Lipschitz condition on the Bregman divergence is 
automatically satisfied. For the particular case of KL divergence this can be achieved via mixing a uniform 
distribution to stay away from boundaries (see e.g. section 4.2 of the paper in this regard). In this case, the 
constant 7 is of Cl(logr). 

B. The Adaptive Optimistic Mirror Descent Algorithm 

The main objective of the paper is to develop the Adaptive Optimistic Mirror Descent (AOMD) algorithm. The 
AOMD algorithm incorporates all notions of variation Dt, Ct and Vt to derive a comprehensive regret bound. The 
proposed method builds on the OMD algorithm with adaptive step size, combined with a doubling trick applied to 
a threshold growing non-monotonically (see e.g. m, nni for application of doubling trick on monotone quantities). 
The scheme is adaptive in the sense that no prior knowledge of Dt, Ct or Vt is necessary. 

Observe that the prior knowledge of a variation budget (an upper bound on Vt) does not tell us how the changes 
between cost functions are distributed throughout the game. For instance, the variation can increase gradually along 
the time horizon, while it can also take place in the form of discrete switches. The learner does not have any 
information about the variation pattern. Therefore, she must adopt a flexible strategy that achieves low regret in the 
benign case of finite switches or shocks, while it is simultaneously able to compete with the worst-case of gradual 
change. Before describing the algorithm, let us first use Lemma [T] to bound the general dynamic regret in terms of 
Dt, Ct iiiid Vt- 

Lemma 2. Let X be a convex set in a Banach space B. Let TZ : B ^ be a 1-strongly convex function on X 
with respect to a norm || • ||. Run the Optimistic Mirror Descent algorithm with the step size given in the statement 
of Lemma\J] Letting the comparator sequence be {ut}J^i, for any L > 2i?i„ax have 

Reg!^(ui, ... ,ut) < 4y/l -f DtL -|- 1 {7C't(ui, ... ,ut) > L^ - ^2 -"4^2 ^ ’ 

so long as Vn{x,z) - DTi{y,z) < y\\x - y\\,'ix,y,z S A. 

We now describe AOMD algorithm shown in table [T] and prove that it automatically adapts to Vt, Dt and Ct- 
The algorithm can be cast as a repeated OMD using different step sizes. The learner sets the parameter L = 3i?niax 
in Lemma[Tl and runs the OMD algorithm. Along the process, the learner collects deviation, variation and regularity 
observed so far, and checks the doubling condition in table [T] after each round. Once the condition is satisfied, the 
learner doubles L, discards the accumulated deviation, variation and regularity, and runs a new OMD algorithm. 
Note importantly that the doubling condition results in a non-monotone sequence of step size during the learning 
process. 
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Algorithm 1 Adaptive Optimistic Mirror Descent Algorithm 
Parameter : i?max, some arbitrary xq € X 

Initialize N = 1, (7(1) = V(i) = 0, D(i) = 1, cci = Xq, Li = 3i?max, Ai = 0 and ki = 1. 
for t = 1 to r do 

% check doubling condition 
if L% < 7min{(7(iv) , then 

% increment N and double 
N = N+1 

L]sr = 3i?„iax2^“^, (7(7 v) = ^{n) = 0, £*(7v) = 1 and A^r = 0 
kfi = t 

end if 

Play xt and suffer loss ft{xt) 

Calculate Mt+i (predictable sequence) and gradient Vt = Vft{xt) 

% update -D(Ar), (7(jv), V(jv) and A^v 
D{n) = D(n) + l|Vt — Mt\\l 
C(Ar) = C'(Af) + ll^^t — 

V(N) = V(Ar) + sup^g;!, \ft{x) - ft-l{x)\ 

An = An + 1 

% set step-size and perform optimistic mirror descent update 
Vt+l = Ln ^\/^(Ar) + ^{N) — l|Vt — 

Xt = argmin I r]t{x,Vt) + X>7t(a:,Xt-i) > 
xGX [_ j 

Xt +1 = argmin rft+i(x, Mt+i) +T>Tz(x,Xt) > 

L J 

end for 


Notice that once we have completed running the algorithm, N is the number of doubling epochs, A^ is the number 
of instances in epoch i, ki and ki+i — 1 are the start and end points of epoch i, Ai = T , ^(0 = 

D(^i) = Dt + N and ^(i) = Also, there is a technical reason for initialization choice of L which 

shall become clear in the proof of Lemma|2] Theorem[3shows the bound enjoyed by the proposed AOMD algorithm. 

Theorem 3. Assume that V-jiix, z) — 'DTi{y, z) < 7 ||a: — ?/||, Vx, y,z G X, and let Ct = ~ 2 ;t_i||. The 

AOMD algorithm enjoys the following bound on dynamic regret : 

Rcgy < O ^\/ Dt + + O ^min {Dt + l)(7r, {Dt + , 

where 0{-) hides a logT factor. 

Based on Theorem [3 we can obtain the following table that summarizes bounds on Reg^ for various cases 
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(disregarding the first term O {\/Dt + l) in the bound above): 


Regime 

Rate 

Ct < T'^^^{Dt + l)-i/3y?/3 

a (v^Ct(19t + 1)) 

E Dt 1 

d{{DT + ifCTC) 

Dt ^ — 1 


Dt = 0{T) 



The following remarks are in order : 

« In all cases, given the condition Vt = o{T), the regret is sub-linear. When the gradients are bounded, the 
regime Dt = 0{T) always holds, guaranteeing the worst-case bound of 
■ Theorem [3 allows us to recover 0(1) regret for certain cases where Vt = 0(1). Let nature divide the horizon 
into B batches, and play a smooth convex function fi{x) on each batch i G [B], that is for some Hi > 0 it 
holds that 

\\^Mx)-^My)\l<H4x-y\l ( 9 ) 

Vi G [B] and Va;,?/ G X. Set Mt = Vfi{xt-i) and note that the gradients are Lipschitz continuous. In this 
case, the OMD corresponding to each batch can be recognized as the Mirror Prox method ifTSl . which results 
in 0(1) regret during each period. Also, since Ct = 0(1) the bound in Theorem|3]is of O(logr). 

IV. Applications 

A. Competing with Strategies 

So far, we mainly considered dynamic regret Reg^ defined in Equation |2] However, in many scenarios one 
might want to consider regret against a more specific set of strategies, defined as follows : 

T T 

Reg? - inf ^/t(7rt(/i:t-i)), 

t=i 

where each tt € H is a sequence of mappings tt = (tti, ... ^ttt) and itt : X. Notice that if 11 is the set 

of all mappings then Reg? corresponds to dynamic regret Reg? and if 11 corresponds to set of constant history 
independent mappings, that is, each tt £ H is indexed by some x G X and trf (■) = ... = 7rf.(-) = x, then Reg? 
corresponds to the static regret Reg?. We now define 

T 

t=l 

where ttJ' = arginf^gjj Ss=i /s('Ts(/i:s-i))- Assume that there exists sequence of mappings Ci,..., Ct where Ct 
maps any fi,ft to reals and is such that for any t and any fi,, ft-i, 

Ct-l{fl:t-l) < Ct{fl-.t), 


January 27, 2015 


DRAFT 











9 


and further, for any T and any /i, -. -, /t, 

T 

- 7rJ'_i(/l:i_2)|| < CtUi-.t)- 

In this case a simple modification of AOMD algorithm where are replaced by CANifk^-kN+i-i) leads to 

the following corollary of Theorem |3] 

Corollary 4. Assume that 'Dti{x,z) — 2?7j(y, z) < ^\\x — j/||,Va:, y, z £ X. The AOMD algorithm with the 
modification mentioned above achieves the following bound on regret 

Reg? < a (yor + l) + a (mmy{DT + l)CT{fi-.T), {Dt + . 

The corollary naturally interpolates between the static and dynamic regret. In other words, letting Crifi-.r) = 0 
(which holds for constant mappings), we recover the result of HD (up to logarithmic factors), whereas Crifi-.T) = 
Ct simply recovers the regret bound in Theorem |3] corresponding to dynamic regret. The extra log factor is the 
cost of adaptivity of the algorithm as we assume no prior knowledge about the environment. 

B. Switching Zero-sum Games with Uncoupled Dynamics 

Consider two players playing T zero sum games defined by matrices At £ [—1, l]™^" for each f £ [T]. We would 
like to provide strategies for the two players such that, if both players honestly follow the prescribed strategies, the 
average payoffs of the players approach the average minimax value for the sequence of games at some fast rate. 
Furthermore, we would also like to guarantee that if one of the players (say the second) deviates from the prescribed 
strategy, then the first player still has small regret against sequence of actions that do not change drastically. To 
this end, one can use a simple modification of the AOMD algorithm for both players that uses KL divergence as 
D-ji, and mixes in a bit of uniform distribution on each round, producing an algorithm similar to the one in HD 
for unchanging uncoupled dynamic games. The following theorem provides bounds for when both players follow 
the strategy and bound on regret for player I when player II deviates from the strategy. 
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On round t, Player I performs 

Play Xt and observe fjAt 
Update 

xt{i) oc mt_i(i)exp{-?7t[/7At]*} 

Xt = {I- (3) Xt + (/3/n) 1„ 
xt+i{i) oc i((i)exp{-r7t+i[//At]i} 
and simultaneously Player II performs 
Play ft and observe AtXt 
Update 

ft{i) oc f't_^{i)eyiY>{-'q't[AtXt]i} 
n = {l-f3)ft + (/3/m) 

/t+i (0 oc fi{i) exp{-r]'t_^_J^[AtXt]^} 


Note that in the description of the algorithm as well as the following proposition and its proof, any letter with 
the prime symbol refers to Player II, and it is used to differentiate the letter from its counterpart for player I. 

Proposition 5. Define J^t — Si=i ~ IlL’ 

f ^ ^ ^ 

r]t = min < log(T^n) 


■\/ =^t-i + a/ ■^t-2 321/ J 


Also define ^t — Si=i ~ |1^, and let 

■q't = min < log(T^m) 


32L 


Let /3 = XjT'^, Mt = fJ_iAt-i, and = At-iXt-i- When Player I uses the prescribed strategy, irrespective of 
the actions of player II, the regret of Player I w.r.t. any sequence of actions ui,, ut is bounded as : 

Y, {fjAtXt - fjAtUt) <2\og{T^n) (Ct(ui, ■ ■ ■, ut) + 2) (^2L + + log(r2n)^ v^. 


t=i 

Further if both players follow the prescribed strategies then, as long as 

2Lf > max {Ct, C'j-} + 3, 

we get, 

T T T 

Y sup f7AtXt<Y fi^AtXt + ^^^ ++4:Y\\At-i-At 


( 10 ) 


+ 32L (log(r^n)C'T + log(T^m)C/ + 2 log(r'^nm)) 
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A simple consequence of the above proposition is that if for instance the game matrix At changes at most K 
times over the T rounds, and we knew this fact a priori, then by letting L = , ^ , we get that regret for Player 

i/log(T2n) 

I w.r.t. any sequence of actions that switches at most K times even when Player II deviates from the prescribed 
strategy is O + 2) y/\og{T^n)T'^. At the same time if both players follow the strategy, then average payoffs 
of the players converge to the average minimax equilibrium at the rate of O {L {K + 2) log(T'^nm)) under the 
condition on L given in (fTOl i. This shows that if the game matrix only changes/switches a constant number of times, 
then players get ^y\og{T)T regret bound against arbitrary sequences and comparator actions that switch at most 
K times while simultaneously get a convergence rate of O (log(T)) to average equilibrium when both players are 
honest. Also, when we let K = 0 and set L to some constant, the proposition recovers the rate in static setting 
GD where the matrix sequence is time-invariant. 

V. Conclusion 

In this paper, we proposed an online learning algorithm for dynamic environments. We considered time-varying 
comparators to measure the dynamic regret of the algorithm. Our proposed method is fully adaptive in the sense that 
the learner needs no prior knowledge of the environment. We derive a comprehensive upper bound on the dynamic 
regret capturing the interplay of regularity in the function sequence versus the comparator sequence. Interestingly, 
the regret bound adapts to the smaller quantity among the two, and selects the best of both worlds. As an instance 
of dynamic regret, we considered drifting zero-sum, two-player games, and characterized the convergence rate to 
the average minimax equilibrium in terms of variability in the sequence of payoff matrices. 
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Appendix ; Proofs 

Proof of Lemma |7] Eor any Ut € X, it holds that 

(xt - Mt, Vt) = (xt - Xt, Vt - Mt) X (xt - Xt,Mt) X {it - Ut^S/t) ■ (H) 

Eirst, observe that for any primal-dual norm pair we have 

(xt - Xt, Vt - Mt) < \\xt - xtll ||Vt - Mtll, . 

Any update of the form a* = argminag.;^ (a, x) + Vnia, c) satisfies for any d G X, 

{a* - d,x) <Vn{d,c) -Vn{d, a*) -Vn{a\c) . 

This entails 

(xt -xt,Mt) < —\'Dn{xt,xt-i) -'Dn{xt,xt) -'Dn{xt,xt-i)\ 

Vt { J 
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and 


{xt - ut,Vt) < —Ivniut.Xt-i) - Vniut.Xt) - Vn{xt,xt-i) \ . 

Vt [ J 

Combining the preceding relations and returning to (HB, we obtain 

{xt - ut,Vt) < —\'Dn{ut,xt-i) - 'Dniut,xt) - Vn^Xt.xt) - 'DTi{xt,xt-i) 
Vt { 


+ \\Vt-Mt\l\\xt-Xt\\ 


1 


< - VTz{ut,Xt) - i ||xt - Xt\\^ - i ||xt_i - Xt\Y 

+ \\Vt-Mt\l\\xt-Xt\\. 


( 12 ) 


where in the last step we appealed to strong convexity: 'Dti{x, y) > ^ ||a: — y\Y for any x,y € X. Using the simple 


inequality ab < ^—h for any p > 0 to split the product term, we get 


Vt 


{xt - ut,^t) < —< 'DTz{ut,xt-i) - Vniut^xt) - - \\xt - xt\Y - - \\xt-i - xt\Y 




\\xt-Xt\Y, 


Applying the bound 


\\xt - Xtf - ^ \\xt - Xtf < R 
2pt+i 2r]t 


2 

max 


2??t+i 


1 1 


Vt+I Vt 


and summing over t G [T] yields , 

T T 


-f -f ^ 1 r 1 

^ {xt - ut,Vt) < ^ IIVt - MtWl +X! Vniut,xt-i) - Vn{ut,xt) ^ + 

t=i t=i t=i ^ 

f-+— 

^ 2 \rii rjT-t 


Ri 


Vt+1 


E 

i=2 


'DTiiut,xt-i) Vniut-i.xt-i) 


Vt 


Vt-i 


^E 

i=2 


f Vniut^xt-i) _ Vniut-i^xt-i) Vuiut-i^Xt-i) _ V-niut-i.xt-i) 
\ Vt Vt Vt Vt-1 


Vt+1 






i=2 




Vt 



' 1 


Jit 

-ut-i\\ 

+ 


1 1 


Vt Vt-1 


+ 


2i?; 


2 

max 


VT-\-l 


*^max 


i=2 


Vt 


VT+l 
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where we used the Lipschitz continuity of V-ji in the penultimate step. Now let us set 

^ L (^/ElZo l|V. - M.llJ - l|V. - M.li;) 


Vt 


Elzl liv. - M4l + Jjzlil ||v. - M4l 


II Vt-i — Mt-i\ 


and II Vo — Mo||^, = 1 to have 


L I 




t=l \ s=l 


J2\\Vs-M4\l- 


. ^||V,-M,||^ 

\ s^O 


- X -^- X - 




< 2^ 1 + ^ ||V, - M,\\l ) . 

Appealing to convexity of {ft}JLi, and replacing Ct (O and Dt dUl in above, completes the proof . 

Proof of Lemma 0 We define 

T 

Ut = lui,...,UT G A” : 7^ IIm* - Mi-i|| < 

t=i ^ 


(13) 


and 


{u*i ,..., Ut) = argmin„^_ ^ Mut). 

t=i 

Our choice of L > 2i?niax guarantees that any sequence of fixed comparators ut = u for t G [T] belongs to Ut, and 
hence, (m|, ..., u^) exists. Noting that (m^, ..., it^) is an element of Ut, we have tELi Ih? - '«i-i||+4^Lx < 

We now apply Lemma [T] to {U(}Ei to bound the dynamic regret for arbitrary comparator sequence {ut}Ei as 
follows. 


T T 

Reg^(ui,...,UT) = ^ |/t(x4) - /t«)| |/t(wt) - Mut)\ 

T 

Z 41 + Dj'L + 

< 4\/l + DtL + 1 I 7 ^ ||ui - ut-i\\ > - 4i?: 


2 

max 




where the last step follows from the fact that 

T T 

Mut)<0 if {ui, ...,ut) & Ut- 

Given the definition of R^ax’ strong convexity of 'D'ji{x, y), we get that ||a; — y\\ < V^Rmax, for any x,y G X. 
This entails that once we divide the horizon into B number of batches and use a single, fixed point as a comparator 
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along each batch, we have 


\\ut - ut-i\\ < B\/2R„ 


(15) 


since there are at most B number of changes in the comparator sequence along the horizon. Now let B = — ^ 

'y V 4/Xinax 

and for ease of notation, assume that T is divisible by B. Noting that ft{xl) < we use an argument similar 

to that of ifTdl to get for any fixed ti £ [(i — 1)(T/B) + l,i(T/B)], 


y] {/*«) - Mut)\ < y] |/t«) - Mxt)\ 

t=i 1 1 t=i 1 1 


(16) 


B i(T/B) 

= E E 

i=l t={i-l)(T/B) + l 
B i{T/B) 

^E E 

i=l t={i-l){T/B) + l 


h{ut) - Mx;) 


Mxi) - ft{x;) 


max 


Mxl) - ft[x*) 


(17) 


(18) 


Note that is fixed for each batch i. Substituting our choice of S = 


■yV^Rn 


in (fTSl) implies that the comparator 


sequence ut = x^.l | + 1 < f < belongs to Ut, and (fTTI) follows by optimality of {ul, ...,Uy). We 

now claim that for any t £ [{i — 1){T/B) + l,i(T/B)], we have, 

i{T/B) 

Mxl) - Mxt) <2 ^ sup \fs{x) - fs-l{x)\- 

Assuming otherwise, there must exist a ti £ [(i — 1){T/B) + l,i{T/B)] such that 

i{T/B) 

fiMu) - fiMV E “ ft-i{x)\, 


(19) 


i=(i-l)(T/B) + l 


xGX 


which results in 


z{T/B) 

Mxl)<fi,ixV+ E sup |/t(x) - /4_i(x)| 

t=(i-l)(T/B) + l 
z{T/B) 

<4«)- E sup |/t(a;) -/4_i(a;)| </t«), 


t={i-l){T/B) + l 


xGX 


The preceding relation for t = ti violates the optimality of xl , which is a contradiction. Therefore, Equation (fT9l l 
holds for any t £ [(i — l)(T/i?) + l,i{T/B)] Combining (fTbl l. ( fTSl l and ( fT9] l we have 

T f . B i(T/B) 

E - “^E E sup |/t(a;) -/t_i(a;)| 


i=l t=(i-l)(T/B) + l ■ 

2TVt 2jV2Rnz..TVT 


B 


- 4i?2, 


( 20 ) 
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Using the above in Equation (fT4l i we conclude the following upper bound 

r T 


Regy(Mi, < 4\/l + DtL + 1 7 ^ Hitt - Ut-i|| > - 4i?: 

thereby completing the proof. 


t=i 


[ ^iRmaxTYr 


Proof of Theorem |3] For the sake of clarity in presentation, we stick to the following notation for the proof 

|2 



)=Di.) 



A ^ 

) = 

~ l^fei+i-l ~ *fei+i-2 1 


) A ^(t) 

- sup \fki+i-l{x) - fki. 

x&X 


^ A- 
) — ^2 

-1, 


for any doubling epoch i = 1, where we recall that fci+i — 1 is the last instance of epoch i. Therefore, any 

symbol with lower bar refers to its corresponding quantity removing only the value of the last instance of that 
interval. 

Let the AOMD algorithm run with the step size given by Lemma [T] in the following form 

U 

Vt = 


El=o liv. - MsWl + IIv. - m£ 

and let Li be tuned with a doubling condition explained in the algorithm. Once the condition stated in the algorithm 
fails, the following parr of identities must hold 

7 min{C(,) , < L? 7 min{C(,) , A?/(21) 

Observe that the algorithm doubles Li only after the condition fails, so at violation points we suffer at most 2G 
by boundedness (|6]l. Then, under purview of Lemma |2] it holds that 

47i?maxAjU^j^ 


N . 

Reg^ < ^ I 4y^L. + 1 {7G(,) > L? _ 


N 


Ll - ARL 


2NG 


( 22 ) 


< g +1 { 7 ., > 

where the last step follows directly from (I 2 TI 1 and the fact that < Df^iy Bounding y^Dy^Li in above, using 
the second inequality in (ISlT i. we get 

< y7min|0(i)G(i) , + 4i?g^^£)(i) 

< + V7min{^G(,)G(,) , , 

by the simple inequality yJa-\-h < ^/a + Vb. Plugging the bound above into (l22b and noting that 


N ^ I 


\ 


N 


-J2Dy^ = ^NDT + N, 
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by Jensen’s inequality, we obtain 

N 

< 2NG + SRuiaxy/ND t + N + 


2=1 


N 


2=1 


U(^) > Ai A(i) A(,) j- 


4-RmaxAill(i) 


{e,.„APF?'=£i;;«}^ 


where we used the first inequality in (l2Tli to bound the last term. Given the condition in the indicator function 
1 {•}, we can simplify above to derive, 


N 


< 2NG + S/^maxv/ NDt + N + 


2=1 


} 


+i: 1 {c,., > ’ 

2=1 

N ^ _ 

2NG + SRm^^VNDr + N + 4^7 y~^ rain |i ^\i) | 


+ 4i?n 


N 


2=1 


S} 


Du^Gu^ > Ay L>yvy^A,^/^ 


< 


2NG + 8Rm^^^/NDG^N + 4^7/min I 

2=1 

TV 

+ 4i?„,ax E min { } ■ 


i=l 


Given the fact that 


(23) 


^(i) < C'(i) 


- ^(*) 


1^(0 < y *) 


A, < A„ 


we return to (|2^ to derive 


iV 


Reg^ < 2NG + 8R^^^^/NDt + N + (4^7 + 4i?n4ax) X! min { > ^(Y)^Ay 


2 = 1 


{ N _ N ^ 

Y. I] ■(" \ 

2=1 2=1 J 

< 2Af (g + + 1 + (2^7 + 2Rmax) min { ^/(At + l)Gr, {Dt + 1)^ 


■ (24) 


where we bounded the sums using the following fact about the summands 


C(,) < Ct 


D(i) < Dt + 1 


V{i) < Vt 


A,; < T. 


To bound the number of batches N, we recall that Li = 3i?max2* and use the second inequality in (1211 1 to bound 
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L^-i as follows 

N = 2 + log2(2^“^) = 2 + log2(LAr_i) - log2(3i?i„ax) 

< 2 + i log 2 ( 7 min + 4^max) “ log2(3i?max) 

< 2 + i log 2 (7C(iV-l) + - log2(3i?i„ax) 

< 2 + i log 2 ( 27 i?maxT + - log 2 (3i?i„ax) ■ 

In view of the preceding relation and (l24li . we have 

Reg^ < + 1 + (2^7 + 2i?max) min { v'(-Dt + 1)Gt, ^ , 

where k = 4 + log 2 {2^RmayT + AR^^^ — 2 log2(3i?i„ax), thereby completing the proof. 


Proof of Proposition |5] Assume that the player I uses the prescribed strategy. This corresponds to using the 
optimistic mirror descent update with TZ{x) = log(2;i) the function that is strongly convex w.r.t. H-H 

Correspondingly, V; = fj At and Mt = Following the line of proof in Lemma [T] in particular, using 

Equation [ 12 ] for the specific case with V-ji, as KL divergence, we get that for any t and any ut £ A„, 

fjAtXt - fjAtUt < - \ Pt - ^tWl - \ I 


+1|/7 \\xt- xtlli 


< 


rit 


\ogi- 


2=1 


i'S] 


-\\\Xt-Xt\\\-]^\\x't_^-Xt\^^ 


1 


+ /t + —maxlog 

m iG[n] 




Now let us bound for some i the term, log Notice that if Xt[i\ < x't[i\ then the term is anyway bounded by 

0. Now assume Xt[i] > Xt[i]. Letting /3 = 1/T^, since x't[i] = (1 — T~'^)xt[i\ + l/(nT^), we can have Xt[i] > x't[i\ 
only when xt[i] > Ijn. Hence, 


log 


xt [j] 


= log 


Xt W 


1 


< 


(1 - T-2)xt[i] + l/(nT2); - T2 ■ 


Using this we can conclude that : 


fjAtXt - fjAtUt X! ( ?r 


- \ \\xt-xt\\l - i \\xt_i-xt\\l 


2 1 


+ \\ft A - ft-iAt-i\\^ \\xt - Still + ^ —■ 
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Summing over t G [T] we obtain that : 


^ {f^AtXt - fjAtUt) < log “ 1 “ I 


T T 

+X! “ /t-i^*-iiioo II®* “ ®*iii+^ y~! ~- 

t=i t=i '* 


Note that < O (''^) ^o assuming T is large enough, ^ 

^ {f^AtXt - fjAtUt) < I] wtW log (^7^) - I ll®t - ®tll? - I ll®Ui -xt\\\ 


+ XI ll-^*'"“ ■^t'-l"^‘-l|loo II®* “ ^tlll + 1- 


(25) 




Now note that we can rewrite the first sum in the above bound and get : 




m 


Xt-iW 


Er=i«i[*]log(s^) Er=i«i-i[*]log(57^) log(T2n) 


t=2 

T 


m 


Vt-1 


m 




Efci (**tW - Mt-iW)iog (-7^ 




Vt 


T n 


+ EE“*-iWlog ?7 


1 


1 1 


®t-iW/ \Vt r]t-i 


logjT^n) 

m 


t=2 i=l 

Since by definition of x^_i, we are mixing in 1 /T^ of the uniform distribution we have that for any i, [i] > 
and, since 774 ’s are non-increasing, we continue bounding above as 


E ^ E ^* w log (^ 7 ^) < iog(7’'n) XI 


t=i ^* i=i 


m 


Xt-iW 


ll**t-i - **tlli 


t=2 


m 


+ \og{T^n)Y,(-- 


1 1 


t=2 V’^* 


log(r^n) 

m 


< \og{T^n) XI 


ll**t-i -utlli 


1 \ log(r^n) 


\t^2 
/ T 


< \og{T'^n) XI 


m rjT m 

\\ut-i - Utll^ 1 \ 


m 


m 


rjT 
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using the above in Equation |25] we get 


flMxt - fjAtUt 


< i„g(TV) y: tizipA 1 i p, _ ,,|i; _ 1 ^ 1 Ii; + 1 


m 






+ Y ll/M - /t-iA-iL 11^* - ^*lli + 




Vt 

1 


< — + _ 1 ^i ||i, __ t ^± _,,||; 


?7r 


2 —' ??t 
t=i 




+ H “ /t-l^«-lL ll^^* “ ^*lll ■ 

t=l 

Notice that our choice of step size given by, 
rjt = min ^ log(r^n) 


,2 ’ 32L 

I OO 


= min l^log(r^n) 
guarantees that 


^JEtl ll/M - f7-iA.-i\L + WflA. - f7.,A-i\ 

L {^JYYl WfjA, - /r-i^.-i|iL - WfjA, - /r-i^.-i|iL) ^ 


“ f7-2At-2 


’ 32L 


= max • 


flAi f7-iAi-i OO + \j71i=i flAi f7-iAi-i 


log(T^n)L 

Using the step-size specified above in the bound \2E[ we get 

T T 

Y Atxt - Y f7Atut 


,32LV. 


t=i 


t=i 


< log(T^n) (Ct(mi, ... jUt) + 2) 


2\/St=i fJAt f7 iAt-i 


log(T 2 n)L 


-f 32L 


Now note that by triangle inequality, we have 

||/7 At — f7-iAt-i\\^ = \\f7 At — fJ At-i + fJ At-i — f7-iAt-i\ 
< \\At-i - Alloo + ll/t “ /t-illi 


< \\At-l - ^tlloo + 


ft — ft-i 


ft-1 — ft-^ 


(26) 


(27) 


+ Y ~ /t-i^t-i|loo ll*‘ “ ^‘lli “ ||it - Still - ^^^Y p't-i - Xt\\\ . (28) 


since the entries of matrix sequence {^t}^i are bounded by one. Using the bound above in ( |28] | and splitting the 
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product term, we see that 

T 

(/7 ^tXt - /7 AtUt^ < log(r^n) (Ct(mi, ...,ut) + 2) 




log(T^n)L 

T T T 

+ 2^^ \\At - At_i||^ -8L'^ lift - Xt\\l - 16Ly^ - ®t||i 

t=l t=l t=l 

^ T” 2 ^ 2 


+ 32L 


+ 


(29) 


where we used the simple inequality ab < f for p > 0 . 

a) When Player II follows prescribed strategy: In this case we would like to get convergence of payoffs to the 

average value of the games. To get this, using the notation = argmin fjAtXt and denoting the corresponding 

a:t 6 A„ 

sequence regularity for Player I by Ct, we get 


1 

(/7 Atxt - f7Atxi^ < log(r^n) (Ct + 2) 


‘2\lT.UWAt-fJ_,At 

\og{T^n)L 


+ 32L 


+ 2^ IIAt - At-i||^ - 8L^ lixt - a:t||J - 16L'^ pt-i “ ^t||i 

t=i t=i 

^ Ik* “ •^*”*lli ^ ^ Ik* “ -^*111' 


+ 


t=l t=l 

where the term i appeared in the last line comparing to ( |29] | is due to 


t=i 

1 

4L’ 


1 II •' " 1 II - 

or, ^ lk‘“^ ~ 1 ~ 16 L ^ Ik* ~ 


16L 


t=i 


t=i 


2 1 
< -. 

1 - 4L 


Using the same bound for Player 2 (using loss as — //AtXt on round t), as well as using f* = argmin — fjAtXt 

ft^^m 

and denoting the corresponding sequence regularity by (7^, we have that 

E (dr + 2) + 32^^ 

T T T 

-2EllA-At-i|L+8LE|k‘--^‘|li + ^®^E|k*-*"-^*lli 


1 . ||2 1 ||2 1 
- E ll*‘ - - TfiT ^ II** - **lli - AT.- 


16 L 


AL 


Combining the two and noting that 


fl^AtXt= sup fjAtxt> inf sup fjAtXt 


/teA„ 


a:teA„ /tGA„ 

= sup inf fjAtXt> inf fjAtXt = fjAtX^, 

ftGA^xt&A„ xteA„ 
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we get 


T T T 

2 ^ 6 /” 1 ^ 

V sup fi^AtXt^y] inf sup + ^—+ — + 4 V P* - ^_i 

^x,eA„/,eA„ T 2 L ^ 


+ log(TP) (Ct + 2) 


+ log(T 2 m) (C^ + 2 ) 


f7fJ-iAt-} 


log(r 2 n)L 


+ 32L 


log(T2m)L 


+ 32i 


("— - 
U 6 L 


T" T" 

8 -^) ^ Pi “ ^ 

- P E l|/< - ^ (4 - ip E ll/- - /' 


V16L " / ^ 11 '"' ■''■ 1 V16E 

^ t=i ^ ^ t=i 

where the constant 256L/T appeared in the first line accounts for the identities 


\\xt-i-Xt\\l-\\x[_-i^-Xt\\\< ^ 


2^2 


ft-1 - ft 


ft-1 - ft 


2 8 
1 - T^' 


Using the triangle inequality again. 


|2 
I oo 




\\f7^t - /t^l^t-l||oo “ X] "^*-1 “ /t^l^t-l| 

t=l 

- w^t-i - ^tiiL+ii/t - ft-i\\i 

t=i t=i 

T T ^ T 

E 2^^ \\At-i - AIIL + 4^^ ||/t - ft-1 + 4^^ ||/t_i - ft-1 


t=i 


t=i 


t=i 


which also implies 


\j:\\ftAt-ff-.iAt.i\\i< 

\ t=l 


A 2^^ Pt-I - ^t||L + 4^^ ||/t - /t-i|| +4^^ ||/t-i - ft-1 
\ t=l t=l t=l 


< 2 , 


Y^\\At.i-At\\l + 2, 




||/t - 




< 2 , 


(30) 


(31) 


y] - AtWl + 2 + 2^] /t - ft-i\\ +2j2\\ft-i - ft- 




< 2 , 


t=l 

T 


t=l 

T 




t=l 


Pt-i - + 10 + 2^] \\ft - ft-i\l + 2j2\\ft - ft\l . (32) 


t = l 
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where we used the bound ^fc < c +1 for any c > 0 in the penultimate line. Similar bounds as Equations OTT ) and 
(|3^ hold for the other player as well. Using them in Equation after some calculations, we conclude that 


T T T 

^ sup ftAtXt<'f2 f7 + +-^ + At\\ 


+ 32L( log(T^n)CT + log(r^m)CT + 2 log(T"^nm)) + (Ct + C't + 4) 
{ Ct + 3 


20 + 4y^ELi \\At-i-At 


+ 4 


i^- 2 E E||A-/i+ 2 E 

^ / \t=l t=l 


-1 - ft 




b) When Player II is dishonest: In this case we would like to bound Player I’s regret regardless of the strategy 
adopted by Player II. Dropping one of the negative terms in Equation |26l we get : 

VT 
T 


f: (pA.^. - flA,u.) < Mr^n)(CTK....,^T) + 2) _ 1 1II _ ^_||. 

^ t = l 


+ “ /t-l^‘-l|loo ll*‘ “ ®‘lll 

log(T2n)(CT(ui,...,UT) + 2) ||2 

- -- 


E 


Vt+1 II .T 


At - fJ_j^At-i\\^ 2 E - - 11 ^* “ • 

i=i '*+1 


(33) 


Noting to the telescoping sum 




2 7^1 

as well as the choice of step-size dZTl i which entails 




\Vt+l Vt J ^T+1 ’ 


E 


Vt+1 II 


At - < log(T2n)^E 


2 

< iog(r^n)4 


t=l \ i=l 


E 11/7^2 


t-1 




|2 
I oo 


J2\\+At-fl,At 

\ t=i 


we bound (l33t to obtain 

T 


y: (/221 ,x, - ,7 An,) < .+ ^ + iog(TS.) 

^ VT VT+1 


L 


J2\\+At-f7-iAt-i\ 

\ t=i 


< 2 log(r^n) (Ct(ui, ..., ut) + 2) 32L -f 


‘2JEi=i\W At - f7-iAt-i 


log{T'^n)L 


L 


+ Ell 

\ t^l 


|2 

I oo' 
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A similar statement holds for Player II that her/his pay off converges at the provided rate to the average minimax 
equilibrium value. ■ 
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